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We investigate statistical properties for a broad class of modern kernel-based regression (KBR) 
methods. These kernel methods were developed during the last decade and are inspired by convex 
risk minimization in infinite-dimensional Hilbert spaces. One leading example is support vector 
regression. We first describe the relationship between the loss function L of the KBR method 
and the tail of the response variable. We then establish the L-risk consistency for KBR which 
gives the mathematical justification for the statement that these methods are able to "learn" . 
Then we consider robustness properties of such kernel methods. In particular, our results allow 
us to choose the loss function and the kernel to obtain computationally tractable and consistent 
KBR methods that have bounded influence functions. Furthermore, bounds for the bias and for 
the sensitivity curve, which is a finite sample version of the influence function, are developed, 
and the relationship between KBR and classical M estimators is discussed. 
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1. Introduction 

The goal in nonparametric regression is to estimate a functional relationship between an 
M d - valued input random variable A and an K- valued output random variable Y, under 
the assumption that the joint distribution P of (A, Y ) is (almost) completely unknown. To 
solve this problem, one typically assumes a set of observations (xi,yi) from independent 
and identically distributed (i.i.d.) random variables (A^, Y{), i = 1, . . . , n, which all have 
the distribution P. Informally, the aim is to build a predictor / : R d — > R on the basis of 
these observations such that /(A) is a "good" approximation of Y. To formalize this aim, 
one assumes a loss L, that is, a continuous function i:lxR-t[0, oo), that assesses the 
quality of a prediction f{x) for an observed output y by L(y, f(x)). Here it is commonly 
assumed that the smaller L(y,f(x)) is, the better the prediction is. The quality of a 
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predictor / is then measured by the L risk 



K L , P (f):=E P L(YJ(X)), 



(1) 



that is, by the average loss obtained by predicting with /. Following the interpretation 
that a small loss is desired, one tries to find a predictor with risk close to the optimal 
risk 7Z* L P := inf{7?.L,p(/)|/ :R d — -> R measurable}. Traditionally, most research in non- 
parametric regression considered the least squares loss L(y, t) := (y — t) 2 , mainly because 
it "simplifies the mathematical treatment" and "leads naturally to estimates which can 
be computed rapidly" as Gyorfi et al. [12], page 2, wrote. However, from a practical point 
of view, there are situations in which a different loss is more appropriate, for example: 

Regression problem not described by least squares loss. It is well known that 
the least squares risk is minimized by the conditional mean Ep(Y\X = x). However, in 
many situations one is actually not interested in this mean, but in, for example, the 
conditional median instead. Now recall that the conditional median is the minimizer 
of 1Zl,p, where L is the absolute value loss, that is, L(y,t) := \y — t\, and the same 
statement holds for conditional quantilcs if one replaces the absolute value loss by 
an asymmetric variant known as the pinball loss; see Steinwart [24] for approximate 
minimizcrs and Christmann and Steinwart [6] for kernel-based quantile regression. 

Surrogate losses. If the conditional distributions of Y | X — x are known to be symmet- 
ric, then basically all loss functions of the form L(y, t) = l(y — t), where I :R — > [0, oo) 
is convex, symmetric and has its only minimum at 0, can be used to estimate the con- 
ditional mean; see Steinwart [24]. In this less steep surrogate such as absolute 
value loss, Huber's loss (given by l(r) = r 2 if \r\ < c, and l(r) = c\r\ — c 2 /2 if |r| > c 
for some c € (0, oo)), or logistic loss (given by l(r) = — log{4A(r)[l — A(r)]}, where 
A(r) := 1/(1 + e~ r )) may be more suitable if one expects outliers in the y direction. 

Algorithmic aspects. If the goal is to estimate the conditional median, then Vapnik's 
e-insensitive loss, given by l(r) = max{|r| — e, 0}, e S (0, oo), promises algorithmic ad- 
vantages in terms of sparseness compared to the absolute loss when it is used in the 
kernel-based regression method described below. This is especially important for large 
data sets, which commonly occur in data mining. 

One way to build a nonparamctric predictor / is to use kernel-based regression (KBR), 
which finds a minimizer f nt \ of the regularized empirical risk 



where A > is a regularization parameter to reduce the danger of overfitting, H is a 
reproducing kernel Hilbert space (RKHS) of a kernel k : X x X — > R and L is a convex 
loss in the sense that L(y, •) : R — > [0, oo) is convex for all y £ Y. Because (2) is strictly 
convex in /, the minimizer f n> \ is uniquely determined and a simple gradient descent 
algorithm can be used to find f n ^\. However, for specific losses such as the least squares 




(2) 
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or the e-insensitive loss, other, more efficient algorithmic approaches are used in practice; 
see Wahba [29], Vapnik [28], Scholkopf and Smola [20] and Suykens et al. [26]. 

Of course, when using KBR, one natural question is whether the risk lZL,p{fn,\) 
actually tends to the minimal risk 1Z* L P if n — ► co. For example, if Y is bounded, H 
is rich in the sense of Definition 11 below and A = A„ — > "sufficiently slowly", this 
question can be positively answered by current standard techniques based on uniform 
deviation inequalities. However, for unbounded Y, no such results presently exist. The 
first goal of this work is to close this gap (using Theorem 12). 

Our second aim is to investigate robustness properties of KBR. To this end we consider 

R?%, x (f):=E P L(YJ(X)) + \\\f\\ 2 H , feH, (3) 

which can be interpreted as the infinite sample version of (2). In Section 4, we describe 
the influence of both the kernel and the loss function on the robustness of the (uniquely 
determined) minimizer /p^ of (3). In particular, we establish the existence of the in- 
fluence function for a broad class of KBR methods and present conditions under which 
the influence function is bounded. Here it turns out that, depending on the kernel and 
the loss function, some KBR methods arc robust while others are not. Consequently, 
our results can help us to choose both quantities to obtain consistent KBR estimators 
with good robustness properties in a situation that allows such freedom. Moreover, it is 
interesting, but in some sense not surprising, that the robust KBR methods are exactly 
the ones that require the mildest tail conditions on Y for consistency. 

2. Some basics on losses, risks and kernel-based 
regression 

In this section we first introduce some important concepts for loss functions that are used 
throughout this work. Thereafter, we investigate properties of their associated risks and 
discuss the interplay between growth behaviour of the loss functions and the tail of the re- 
sponse variable. Finally, we establish existence and stability results for the infinite-sample 
KBR methods given by (3). These results are needed to obtain both the consistency re- 
sults in Section 3 and some of the robustness results in Section 4. 

Definition 1. Let Y C R be a non-empty closed subset and let L : Y x R — > [0, oo) be a 
measurable loss function. Then L is called invariant if there exists a function I : R — > [0, co) 
with 1(0) = and L(y,t) = l(y — t) for all y £Y , tel. Moreover, L is called Lipschitz 
continuous if there exists a constant c > such that 

\L(y,t)-L(y,t')\<c-\t-t'\ (4) 

for all y €Y, t,t' £ R. In this case, we denote the smallest possible c in (^) by \L\\. 

Note that an invariant loss function L is convex if and only if the corresponding I is 
convex. Analogously, L is Lipschitz continuous if and only if / is Lipschitz continuous: 
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in this case, we have \L\i = \l\i, where |Z|i denotes the Lipschitz constant of I. Many 
popular loss functions for regression problems are convex, Lipschitz continuous and in- 
variant. Three important examples are the Vapnik e-insensitive loss, the Huber loss and 
the logistic loss functions introduced in the Introduction. Moreover, the least squares 
loss function L(y,t) = (y — t) 2 is convex and invariant, but not Lipschitz continuous. The 
logistic loss function is a compromise between the other three loss functions: it is twice 
continuously diffcrcntiable with L" > 0, which is true for the least squares loss function, 
and it increases approximately linearly if \y — t\ tends to infinity, which is true for Vap- 
nik's and Huber's loss functions. These four loss functions are even symmetric, because 
L(y, t) = L(t, y) for y,igR. Asymmetric loss functions may be interesting in some appli- 
cations where extremely skewed distributions occur, for example, analysis of claim sizes 
in insurance data (see Christmann [4]). 

The growth behaviour of L plays an important role in both consistency and robustness 
results. Hence we now introduce some basic concepts that describe the growth behaviour. 

Definition 2. Let L:FxK-> [0, oo) be a loss function, let a:Y — > [0, oo) be a measurable 
function and let p €E [0, oo). We say that L is a loss function of type (a,p) if there exists 
a constant c > such that L(y, t) < c(a(y) + \t\ p + 1) for all y GY and all t £ R. We say 
that L is of strong type (a,p) if the first two partial derivatives L' := &2L and L := &22L 
of L with respect to the second argument of L exist, and L, L' and L" are of (a,p) type. 

For invariant loss functions, it turns out that there is an easy way to determine their 
type. To describe the corresponding results, we need the following definition. 

Definition 3. Let L be an invariant loss function with corresponding function I : R — ► R 
and let p>0. We say that L is of upper order p if there exists a constant c > such that 
1(f) < c(\r\ p + 1) for all r G R. Analogously, we say that L is of lower order p if there 
exists a constant c > such that l(r) > c(\r\ p — 1) for all r£l. 

Recalling that convex functions are locally Lipschitz continuous, we see that for invari- 
ant losses L, the corresponding I is Lipschitz continuous on every interval [— r, r]. Conse- 
quently, V(r) := |/|[- r . r ]|i, where |Z|[- r . r ]|i denotes the Lipschitz constant of the restriction 
l\[- r ,r] 01 ' onto [~ r i r ] f° r r — 0j defines a non-decreasing function V : [0, 00) — > [0, 00). We 
denote its symmetric extension also by V, so that we have V(— r) = V(r) for all r 6 R. 
The next result lists some properties of invariant losses. 

Lemma 4. Let L be an invariant loss function with corresponding I : R — > R and p>0. 

(i) If L is convex and satisfies lim^^^, l(r) = 00, then it is of lower order 1. 

(ii) If L is Lipschitz continuous, then it is of upper order 1 . 

(iii) If L is convex, then for all r > we have V(r) < -||Zir_2r,2r]lloo < 4V(2r). 

(iv) If L is of upper order p, then L is of type (a,p) with a(y) := \y\ p , y £7 . 

The proof of this lemma as well as the proofs of all following results can be found in 
the Appendix. With the help of Lemma 4, it is easy to see that the least squares loss 
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function is of strong type (y 2 , 2). Furthermore, the logistic loss function is of strong type 
(|j/|,l) because it is twice continuously diffcrcntiable with respect to its second variable 
and both derivatives are bounded, namely \d2L(y,t) \ < 1 and \d22L(y,t)\ < h, t 6l. The 
Huber and Vapnik loss functions are of upper and lower order 1 because they are convex 
and Lipschitz continuous; however, they are not of any strong type because they are not 
twice continuously diffcrcntiable. 

Our next goal is to find a condition that ensures 72.x, i p(/) < oo. To this end we need 
the following definition, which for later purposes is formulated in a rather general way 
(see Brown and Pearcy [2] for signed measures). 

Definition 5. Let fi be a signed measure on X x Y with total variation and let 
a:Y — > [0,oo) be a measurable function. Then we write \fj,\ a := J x xY a(y) d\fj,\(x, y). If 
a {y) = \y\ p f or some p > and all y £Y , we write |/aL := |/x| whenever no confusion can 
arise. Finally, we write |/x|o := ||/z||jvf, where \\^\\m denotes the norm of total variation. 

We can now formulate the following two results investigating finite risks. 

Proposition 6. Let L be an (a, p) -type loss function, let P be a distribution on X x Y 
with |P| a < oo and let f : X — > K be a function with f £ L p (P). Then we have 7£l,p(/) < 
oo. 

Lemma 7. Let L be an invariant loss of lower order p, let f : X — > R be measurable and 
let P be a distribution X x Y with IZl p(/) < oo. Then |P| p < oo if and only if f £ L„(P). 

If L is an invariant loss function of lower and upper order p and P is a distribution with 
|P| P = oo, Lemma 7 shows 72-l,p(/) = oo for all / £ L P (P). This suggests that we may even 
have 72.l,p(/) = oo for all measurable f :X —>Y. However, this is in general not the case. 
For example, let Px be a distribution on X and let g : X — > M be a measurable function 
with g L p (Px)- Furthermore, let P be the distribution oilxl whose marginal 
distribution on X is Px and whose conditional probability satisfies P(V = g(x)\x) = 1. 
Then we have |P| P = oo, but TZL,p(g) = 0. 

Our next goal is to establish some preliminary results on (3). To this end, recall that 
the canonical feature map of a kernel k with RKHS H is defined by <f>(x) := k(-,x), 
x £ X. Moreover, the reproducing property gives f(x) = (f,k(-,x)) for all f £ H and 
x £ X. Of special importance in terms of applications is the Gaussian radial basis func- 
tion (RBF) kernel k(x,x') = exp(— -f\\x — x'\\ 2 ), 7 > 0, which is a universal kernel on 
every compact subset of M. d ; see Definition 11. This kernel is also bounded because 
|fc||oo = swp{y/k(x,x) :x £ M. d } = 1. Polynomial kernels k(x, x') = (c + (x, x')) m , m > 1, 
c > 0, x, x' £ K, are also popular in practice, but obviously they are neither universal 
nor bounded. Finally, recall that for bounded kernels, the canonical feature map satisfies 
||*(a:)||fr < ||fc||oo for all x £ U d . 

Let us now recall a result from DeVito et al. [8] that shows that a minimizer /p. \ of 
(3) exists. 
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Proposition 8. Let L be a convex loss function of (a,p) type, let P be a distribution on 
X xY with P| a < oo, let H be a RKHS of a bounded kernel k and let A > 0. Then there 
exists a unique minimizer fp \ £ H of f i— > 7£™p x (f) and ||/p,a||h < \/^l,p(0)/A := 

#P,A- 

If i? is a RKHS of a bounded kernel and L is a convex, invariant loss of lower and 
upper order p, then it is easy to see by Lemma 7 that exactly for the distributions P with 
|P| p < oo, the minimizer fp\ is uniquely determined. If |P| p = oo, we have T^^% \(f) = oo 
for all f £ H . Hence, we will use the definition /p a := for such P. 

Our next aim is to establish a representation of /p,a- To this end, we define, for 
p £ [1, oo], the conjugate p' £ [1, oo] by l/p+ 1/p' = 1. Furthermore, we have to recall the 
notion of subdifferentials; see Phelps [17]. 

Definition 9. Let H be a Hilbert space, let F.H — *1U {oo} be a convex function and 
let w £ H with F(w) =/= oo. Then the subdifferential of F at w is defined by 

dF{w) :={w* £ H :{w*, v-w) <F(v)-F(w) for alive H}. 

With the help of the subdifferential, we can now state the following theorem, which 
combines results from Zhang [30], Steinwart [22] and DcVito et al. [8]. 

Theorem 10. Let p > 1, let L be a convex loss function of type (a,p) and let P be a 
distribution on X x Y with |P| a < oo. Let H be the RKHS of a bounded, continuous 
kernel k over X and let $ : X — > H be the canonical feature map of H . Then there exists 
an he L p /(P) such that h{x,y) £ diL{ij, fp,\(x)) for all (x,y) £ X x Y and 

/ P ,A = -(2A)- 1 Ep/ l $, (5) 

where d^L denotes the subdifferential with respect to the second variable of L. Moreover, 
for all distributions Q on X xY with |Q| a < oo, we have h £ L p /(P) (~1 ^i(Q) and 

||/p,a - Iq.xWh < A-iEpfcS - Eq/^Hh, (6) 
and if L is an invariant loss of upper order p and \P\ p < oo, then h £ L p i (P) (~l L p i (Q) . 

3. Consistency of kernel-based regression 

In this section we establish L-risk consistency of KBR methods, that is, we show that 
T^L,p(fn,\ n ) —> p holds in probability for n — > oo and suitably chosen regularization 
sequences (A„). Of course, such convergence can only hold if the RKHS is rich enough. 
One way to describe the richness of H is the following definition taken from Steinwart 
[21]. 
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Definition 11. Let X dW 1 be compact and let k be a continuous kernel on X . We say 
that k is universal if its RKHS is dense in the space of continuous functions (C(X), \\ ■ 

Hoc)- 

It is well known that many popular kernels including the Gaussian RBF kernels are 
universal; see Steinwart [21] for a simple proof of the universality of the latter kernel. 
With Definition 11. we can now formulate our consistency result. 

Theorem 12. Let X C M. d be compact, let L be an invariant, convex loss of lower and up- 
per order p > 1 and let H be a RKHS of a universal kernel on X . Define p* := max{2p,p 2 } 
and fix a sequence (A n ) C (0, oo) with X n — > and n — > oo. Then f n> \ n based on (2), 
using X n for sample sets of length n, is L-risk consistent for all P with \P\ p < oo. 

Note that Theorem 12 in particular shows that KBR using the least squares loss 
function is weakly universally consistent in the sense of Gyorfi et al. [12]. Under the 
above assumptions on L, H and (A„), we can even characterize the distributions P 
for which KBR estimates based on (2) are L-risk consistent. Indeed, if \P\ p = oo, then 
KBR is trivially L-risk consistent for P whenever 1Z* L P = oo. Conversely, if |P| P = oo 
and 1Z* L P < oo, then KBR cannot be L-risk consistent for P because Lemma 7 shows 
ft£,p(/)' = oo for all feH. 

In some sense it seems natural to consider only consistency for distributions that satisfy 
the tail assumption |P| p < oo, because this was done, for example, in Gyorfi et al. [12] 
for least squares methods. In this sense, Theorem 12 gives consistency for all reasonable 
distributions. However, the above characterization shows that our KBR methods are 
not robust against small violations of this tail assumption. Indeed, let P, and P be two 
distributions with |P| p < oo, |P| p = oo and 1Z L P (/*) < oo for some /* £ L p (P). Then 
every mixture distribution Q e := (1 — e)P + eP, e e (0, 1), satisfies both |Q e | p = oo and 
7£2,Q e < oo, and thus KBR is not consistent for any of the small perturbation Q e of P, 
while it is consistent for original distribution P. From a robustness point of view, this is 
of course a negative result. 

4. Robustness of kernel-based regression 

In the statistical literature, different criteria have been proposed to define the notion 
of robustness in a mathematical way. In this paper, we mainly use the approach based 
on the influence function proposed by Hampel [13]. We consider a map T that assigns 
to every distribution P on a given set Z , an clement T(P) of a given Banach space E. 
For the case of the convex risk minimization problem given in (3) , we have E = H and 
T(P) = /p,a- Denote the Dirac distribution at the point z by A z , that is, A z ({z}) = 1. 

Definition 13 [Influence function). The influence function (IF) of T at a point z for 
a distribution P is the special Gateaux derivative ( if it exists ) 

IF(z; T, P) = lime- 1 {r((l - e)P + eA z ) - T(P)}. (7) 
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The influence function has the interpretation that it measures the impact of an in- 
finitesimal small amount of contamination of the original distribution P in the direction 
of a Dirac distribution located in the point z on the theoretical quantity of interest T(P). 
Hence it is desirable that a statistical method T(P) has a bounded influence function. 
We also use the sensitivity curve proposed by Tukey [27]. 

Definition 14 (Sensitivity curve). The sensitivity curve (SC) of an estimator T n at a 
point z given a data set z\, . . ., z n -\ is defined by 

SC„(z;T„) =n(T n (zi, . . .,z n -i,z) -T n -i(zi, ■ ■ .,z n -x))- 

The sensitivity curve measures the impact of a single point z and is a finite sample 
version of the influence function. If the estimator T n is defined via T(P„), where P„ 
denotes the empirical distribution that corresponds to z\, . . . ,z n , then we have, for e„ = 
1/n, 

SC„(z; T n ) = (T((l - £„)P„_i + e„ A z ) - T(P n _i))/e n . (8) 

In the following discussion, we give sufficient conditions for the existence of the influence 
function for the kernel-based regression methods based on (3). Furthermore, we establish 
conditions on the kernel k and on the loss L to ensure that the influence function and 
the sensitivity curve are bounded. Let us begin with the following results that ensure the 
existence of the influence function for KBR if the loss is convex and twice continuously 
diffcrentiable. 

Theorem 15. Let H be a RKHS of a bounded continuous kernel k on X with canonical 
feature map <&:X~tH and let L:Fxl-t[0, 00) be a convex loss function of some 
strong type (a,p). Furthermore, let P be a distribution on X x Y with |P| a < 00. Then 
the influence function o/T(P) := /p j exists for all z := (x,y) € X xY and we have 

IF(z;T, P) = 5- 1 (Ep( J L'(y, / P , A (X))$(X))) - L'(y, /p,^))^ 1 ^), (9) 

where S : H -> H , S = 2Xid H + E P L"{Y, f PtX (X))(<f>(X), -}<S>(X), is the Hessian of the 
regularized risk. 

It is worth mentioning that the proof can easily be modified to replace point mass con- 
taminations A z by arbitrary contaminations P that satisfy |P| a < 00. As the discussion 
after Theorem 12 shows, we cannot omit this tail assumption on P in general. 

From a robustness point of view, one is mainly interested in methods with bounded 
influence functions. Interestingly, for some kernel-based regression methods based on (3), 
Theorem 15 not only ensures the existence of the influence function, but also indicates 
how to guarantee its boundedness. Indeed, (9) shows that the only term of the influence 
function that depends on the point mass contamination A z is 



-L'iyJp^S-^ix). 



(10) 
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Now recall that $ is uniformly bounded because k is assumed to be bounded. Conse- 
quently, the influence function is bounded in z if and only if —L'(y,fp,\(x)) is uniformly 
bounded in z = (x,y). Obviously, if, in addition, L is invariant and Y = R, then the 
latter condition is satisfied if and only if L is Lipschitz continuous. Let us now assume 
that we use, for example, a Gaussian kernel on X = R d . Then the influence function is 
bounded if and only if L is Lipschitz continuous. In particular, using the least squares 
loss in this scenario leads to a method with an unbounded influence function, while using 
the logistic loss function or its asymmetric generalization provides robust methods with 
bounded influence functions. 

Unfortunately, the above results require a twice continuously differentiable loss and, 
therefore, they cannot be used, for example, to investigate methods based on the e- 
insensitive loss or Hubcr's loss. Our next results, which in particular bound the difference 
quotient used in the definition of the influence function, apply to all convex loss functions 
of some type (a,p) and hence partially resolve the above problem for non-differentiable 
losses. 

Theorem 16. Let L : Y x R — > [0, oo) be a convex loss of some type (a,p), and let P and 
P be distributions on X xY with |P| a < oo and |P| a < oo. Furthermore, let H be a RKHS 
of a bounded, continuous kernel on X . Then for all A > 0, e > 0, we have 

ll/(i- e) p+ e p,A - /p.aIIh < 2c[X6 P . x ]- 1 e(\P\ a + \P\ a + 2V+ 1 6 p P Jk\\V QC + 2), 

where c is the constant of the type (a,p) inequality and 6p,\ = -\/7?.l,p(0)/A. 

For the special case P = A z with z — (x, y), we have |A z | a = a(y) and hence we obtain 
bounds for the difference quotient that occurs in the definition of the influence func- 
tion if we divide the bound by e. Unfortunately, it then turns out that we can almost 
never bound the difference quotient uniformly in z by the above result. The reason for 
this problem is that the (a,p) type is a rather loose concept for describing the growth 
behaviour of loss functions. However, if we consider only invariant loss functions - and 
many loss functions used in practice are invariant - we are able to obtain stronger results. 

Theorem 17. Let L : Y x R — > [0, oo) be a convex invariant loss of upper order p>l, 
and let P and P be distributions on X x Y with |P| p < oo and \P\ p < oo. Furthermore, 
let H be a RKHS of a bounded, continuous kernel on X . Then for all A > ; e > 0, we 
have 

ll/(i- e )P +E P,A-/p,A||* <a- 1 ||fc|Ue|P-P| p _ 1 + |P-P| (||fc||^ 1 |P|^- 1 )/ 2 A( 1 ^/ 2 + 1)i 

where the constant c only depends on L and p. Moreover, if in addition L is Lipschitz 
continuous, then for all A > 0, e > 0, we have 

l|/(l- £ )P +£ P,A-/p,A|| ff <A- 1 ||fc||oo|i|l||P-P||M£. 
In particular considering (2), we have ||SC„(z; T n ) \\h < 2A _1 HfcHoolLji for all z G X xY . 
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Finally, let us compare the influence function of kernel-based regression methods with 
the influence function of M estimators in linear regression models with f{x{) = x'fl, 
where G M. d denotes the unknown parameter vector. Let us assume for reasons of sim- 
plicity that the scale parameter a £E (0, oo) of the linear regression model is known. For 
more details about such M estimators, see Hampel et al. [14]. The functional T(P) that 
corresponds to an M estimator is the solution of 



where the odd function rj(x, •) is continuous for x € M. d and rj(x,u) > for all x £ R d , 
u £ [0, oo). Almost all proposals of r\ may be written in the form rj(x,u) = ip(v(x) ■ u) ■ 
w(x), where ip : R — > R is a suitable user-defined function (often continuous, bounded and 
increasing), and u>:R d — > [0, oo) and v:M. d — > [0, oo) are weight functions. An important 
subclass of M estimators is of Mallows type, that is, r](x,u) = tp(u) ■ w(x). The influence 
function of T(P) = 9 in the point z = (x, y) at a distribution P for (X, Y) on R d x R is 
given by 



where M(r/,P) := E P r?'(X, (Y - X'T(P))/a)XX'. An important difference between 
kernel-based regression and M estimation is that IF(z; T, P) € R d in (12), but IF(z; T, P) e 
H in (9) for point mass contamination in the point z. 

A comparison of the influence function of KBR given in (9) with the influence function 
of M estimators given in (12) yields that both influence functions have, nevertheless, 
a similar structure. The function S = S(L",P,k) for KBR and the matrix M(n,P) for 
M estimation do not depend on z. The terms in the influence functions that depend on 
z = (x,y), where the point mass contamination A z occurs, are a product of two factors. 
The first factors are —L'(y, fp,\(x)) for general KBR, ip(v(x) ■ (y — x'9)/a) for general M 
estimation, l'(y — fp,\(x)) for KBR with an invariant loss function and ifj((y — x'9) jo) for 
M estimation of Mallows type. Hence the first factors measure the outlyingness in the y 
direction. The KBR with an invariant loss function and M estimators of Mallows type 
use first factors that depend only on the residuals. The second factors are S~ 1 ^(x) for 
the kernel-based methods and w(x)x for M estimation. Therefore, they do not depend 
on y and they measure the outlyingness in the x direction. 

In conclusion, one can say that there is a natural connection between KBR estimation 
and M estimation in the sense of the influence function approach. The main difference 
between the influence functions is, of course, that the map S~ 1 ^(x) takes values in the 
RKHS H in the case of KBR, whereas w(x)x € R d for M estimation. 

5. Examples 

In this section we give simple numerical examples to show the following concepts: 

• The KBR with the e-insensitive loss function is indeed more robust than the KBR 
based on the least squares loss function if there are outliers in the y direction. 



E PV (X, (Y-X'T(P))/a)X = 0, 



(11) 



IF(z; T, P) = A/" 1 (tj, P) • V (x, (y - x'T(P))/a) ■ x, 



(12) 
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• In general, there is no hope to obtain robust predictions f(x) with KBR if x belongs 
to a subset of the design space X where no or almost no data points are in the 
training data set. that is, if a; is a leverage point. 

We constructed a data set with n = 101 points in the following way. There is one ex- 
planatory variable Xi with values from —5 to 5 in steps of order 0.1. The responses yi are 
simulated by yi — x, + e$ , where e$ is a random number from a normal distribution with 
expectation and variance 1. As hyperparameters, we used (e,7,A) = (0.1,0.1,0.05) for 
the e-insensitive loss function with a RBF kernel, (e, A) = (0.1,0.05) for the e-insensitivc 
loss function with a linear kernel and (7, A) = (0.1, 0.05) for the least squares loss function 
with an RBF kernel. 

The e-insensitive support vector regression (e-SVR) and least squares support vector 
regression (LS-SVR) with similar hyperparameters give almost the same fitted curves; see 
Figure 1(a). However, Figure 1(b) shows that e-SVR is much less influenced by outliers 
in the y direction (one data point is moved to (x,y) = (—2,100)) than is LS-SVR (cf. 
Wahba [29]) due to the different behaviour of the first derivative of the losses. 

Now we add sequentially to the original data set three samples all equal to (x, y) 
: 100. 0) which are bad leverage points with respect to a linear regression model. The num- 
ber of such samples has a large impact on KBR with a linear kernel, but the predictions 
of KBR with a Gaussian RBF kernel are stable (but nonlinear), see Figure 1(c). 

Now we study the impact of adding to the original data set two data points z\ = 
(100, 100) and z 2 = (0, 100) on the predictions of KBR; see Figure 1(d). By construction, 
z\ is a good leverage point and z 2 is a bad leverage point with respect to a linear regression 
model which follows, for example, by computing the highly robust least trimmed squares 
(LTS) estimator (Rousseeuw [19]), whereas the roles of these data points are switched for 
a quadratic model. There is no regression model which can fit all data points well because 
the x components of z\ and z 2 are equal by construction. We used the e-insensitive loss 
function with a RBF kernel with hyperparameters (e,7, A) = (0.1,0.1,0.05) for the curve 
RBF, a and (e,7,A) = (0.1,0.00001,0.00005) for the curve RBF, b. This toy example 
shows that, in general, one cannot hope to obtain robust predictions f{x) for Ep(y|AT = 
x) with KBR if x belongs to a subset of X where no or almost no data points are in 
the training data set, because the addition of a single data point can have a big impact 
on KBR if the RKHS H is rich enough. Note that the hyperparameters e, 7 and A were 
specified in these examples to illustrate certain aspects of KBR and, hence, were not 
determined by a grid search or by cross-validation. 

6. Discussion 

In this paper, properties of kernel-based regression methods including support vector 
machines were investigated. Consistency of kernel-based regression methods was derived 
and results for the influence function, its difference quotient and the sensitivity curve 
were established. Our theoretical results show that KBR methods using a loss function 
with bounded first derivative (e.g., logistic loss) in combination with a bounded and 
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20 40 60 BO 100 20 40 50 BO 100 

Figure 1. Results for simulated data sets without or with artificial outliers, (a) Linear rela- 
tionship, no outliers: e-insensitive loss function L e (solid) and the least squares loss function 
Lls (dashed) both with a RBF kernel give almost the same results, (b) Linear relationship, one 
outlier in the y direction at (x,y) = ( — 2, 100): KBR with a RBF kernel performs more robustly 
if L E (solid) is used instead of Lls (dashed), (c) Linear relationship with additional 1, 2 or 3 
extreme points in (x,y) = (100,0): KBR using L £ with a linear kernel (dashed) and an RBF 
kernel (solid), (d) Linear relationship with two additional data points in (x,y) — (100,0) and 
(x,y) — (100,100): KBR with a linear kernel (dashed) and two curves based on RBF kernel 
(solid) with different values of A and 7. 



rich enough continuous kernel (e.g., a Gaussian RBF kernel) are not only consistent and 
computational tractable, but also offer attractive robustness properties. 

Most of our results have analogues in the theory of kernel-based classification methods; 
see, for example, Christmann and Steinwart [5] and Steinwart [23]. However, because in 
the classification scenario Y is only { — 1,1} valued, many effects of the regression scenario 
with unbounded Y do not occur in the above papers. Consequently, we had to develop a 
variety of new techniques and concepts: one central issue here was to find notions for loss 
functions which, on the one hand, are mild enough to cover a wide range of reasonable 
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loss functions and, on the other hand, are strong enough to allow meaningful results 
for both consistency and robustness under minimal conditions on Y . In our analysis it 
turned out that the relationship between the growth behavior of the loss function and 
the tail behavior of Y plays a central role for both types of results. Interestingly, similar 
tail properties of Y are widely used to obtain consistency of nonparametric regression 
estimators and to establish robustness properties of M estimators in linear regression. For 
example, Gyorfi et al. [12] assumed EpY 2 < oo for the least squares loss, Hampcl et al. 
([14], page 315) assumed existence and non-singularity of Ep7y'(X, (Y — X'T(P))/a)XX' 
and Davies ([7], page 1876) assumed Ep||X||(||A|| + |Y|) < oo. Another important issue 
was to deal with the estimation error in the consistency analysis. We decided to use a 
stability approach to avoid truncation techniques, so the proof of our consistency result 
became surprisingly short. An additional benefit of this approach was that it revealed 
an interesting connection between the robustness and the consistency of KBR methods. 
A somewhat similar observation was recently made by Poggio et al. [18] for a wide class 
of learning algorithms. However, they assumed that the loss or Y is bounded and hence 
their results cannot be used in our more general setting. 

Our result concerning the influence function of kernel-based regression (Theorem 15) 
are valid under the assumption that the loss function is twice continuously differentiable, 
whereas our other robustness results are valid for more general loss functions. The strong 
differentiability assumption was made because our proof is based on a classical theorem 
of implicit functions. We have not investigated whether similar results hold true for con- 
tinuous but not differentiable loss functions. It may be possible to obtain such results by 
using an implicit function theorem for non-smooth functions based on a weaker concept 
than Frechet differentiability. However, there are indications why a smooth loss function 
may even be desirable. The function —L' has a role for kernel-based regression similar to 
the ip function for M estimators. Huber ([16], page 51) considered robust estimation in 
parametric models and investigated the case that the underlying distribution is a mix- 
ture of a smooth distribution and a point mass. He showed that an M estimator has a 
non-normal limiting behaviour if the point mass is at a discontinuity of the derivative of 
the score function. Because distributions with point masses are not excluded by nonpara- 
metric regression methods such as KBR, his results indicate that a twice continuously 
differentiable loss may guard against such phenomena. 

To the best of our knowledge there are no results on robustness properties of KBR 
which are comparable to those presented here. However, we would like to refer to 
Scholkopf and Smola [20], who already gave arguments for better robustness properties 
of KBR when Huber's loss function is used instead of the least squares loss function. 

Theorems 15 to 17 and the comments after Theorem 12 show that KBR estimators 
based on appropriate choices of L and k have a bounded influence function if the dis- 
tribution P has the tail property |P| a < oo, but are non-robust against small viola- 
tions of this tail assumption. The deeper reason for this instability is that the theo- 
retical regularized risk itself is defined via EpL(Y, /(X)), which is a non-robust loca- 
tion estimator for the distribution of the losses. This location estimator can be infi- 
nite for mixture distributions (1 — e)P + eP no matter how small e > is. Following 
general rules of robust estimation in linear regression models, one might replace this 
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non-robust location estimator by a robust alternative like an a-trimmed mean or the 
median (Rousseeuw [19]), which results in the kernel-based least median of squares es- 
timator fp x = argmin/ e # MedianpL(F, f(X)) + A||/j|^. We conjecture that /p A offers 
additional robustness, but sacrifices computational efficiency. However, such methods are 
beyond the scope of this paper. 

Appendix: Proofs of the results 

Proof of Lemma 4. The left inequality of (iii) is well known from convex analysis and 
the right inequality of (iii) easily follows from l(r) = \l(r) — Z(0)| < MV(r) for all r G K. 
Moreover, the right inequality of (iii) directly implies (ii). Furthermore, (i) can be easily 
obtained by the left inequality of (iii) because limiH^oo l(r) = oo implies V(r) > for all 
r ^ and the convexity of / ensures that r \— * l(\r\) is monotone. Finally, the last assertion 
follows from L(y,t) = l(y -t)< c(\y - t\ p + 1) < c(\y\ p + \t\ p + 1). □ 

Proof of Proposition 6. For bounded measurable functions >M, we have 

H L ,p(f) < cM P (a(Y) + \f(X)\P + 1) < c||o||i l(P) + c\\f\\ p Lp{p) + c < oo. □ 

Proof of Lemma 7. For all a,beR, we have (|o| + |6|) p < 2P~ 1 (\a\P + \b\ p ) if p> 1 and 
(M + \b\) p < \a\P + \b\P otherwise. This obviously implies \a\ p < 2P~ 1 (\a - b\ p + \b\P) and 
\a\ p < \a — b\ p + \b\ p , respectively. If / G L p (P), we consequently obtain 

cE p (\Y\p - c p \f(X)\P - 1) < cE P (\Y - f(X)\P - 1) < K L ,p(f) < oo 

for some finite constants c > and c p > 0. From this we immediately obtain |P| p < oo. 
The converse implication can be shown analogously. □ 

Proof of Proposition 8. The first two assertions follow from Proposition 8 of DcVito 
et al. [8] and the last assertion is trivial. □ 

Proof of Theorem 10. The existence of h and (5) already have been shown by DcVito 
et al. [8]. Moreover, for (x,y) E X xY and K\ := <5p,a||&||oo) we have 

\h{x,y)\ < |-%,-)|[-/p,A(aO,/p,A(*)]li < 2K x^ \\ L (V, OlH^.^IL 

<2cif A - 1 (a(y) + |4^ A |P + l) (13) 

from which we deduce h G Li(Q). Furthermore, by the definition of the subdiffcrcntial, 
we have h(x,y){f Q ,\(x) - fp,\(x)) < L(y,f Q> x(x)) - L{y,fp,\(x)) and hence 

E {X ^ Q L(Y, f P ,x(X)) + (/ Q , A - / p , a ,EqM>) < E {XjY) „ Q L(Y, / Q , A (X)). (14) 

Moreover, an easy calculation shows 

M\M\h + </q,a - /p,x,2A/ p , a ) + X\\f P , x - f Q , x \\ 2 H = X\\f Q ,x\\ 2 H . (15) 
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Combining (14), (15), 2A/ p ,a = -E P /i$ and K s QX {f Q , x ) < K% x {f P , x ), it follows that 

^l,q,a(/p,a) + (/q,a - /p,a,EqM> - E P M>> + A||/ P , A - f QiX \\ 2 H < K s QX (f P , x ). 

Hence we obtain A||/ P , A - /q,a|Ih < (/p,a - /q,A,Eq/i$ - E P /i$) < ||/ P)A - /q,a||h • 
|Eq/i$ — Eph^Wn, which shows the assertion in the general case. 

Now let us assume that L is invariant. As usual, we denote the function that repre- 
sents L by I : R — > [0, oo). Then we easily check that h satisfies h(x, y) G —dl(y — fp,\(x)) 
for all (x,y) G X x Y. Now for p = 1 we see by (iv) of Lemma 4 that I is Lips- 
chitz continuous and hence the function V is constant. Using \h(x,y)\ < V(y — fp.\(x)) 
(cf. Phelps [17], Proposition 1.11) we thus find h G £oo(Q)> which is the assertion for 
p = 1. Therefore, let us finally consider the case p > 1. Then for (x,y) G X x V with 
r := |y - fp,\(x)\ > 1, we have \h(x,y)\ < \dl{y - f F ,x(x))\ < V(r) < f ||Z| [2 r,2r]||oc < cr^" 1 
for a suitable constant c> 0. Furthermore, for (i,j/)elxy with \y — /p >A (a:)| < 1, we 
have \h(x,y) \ < \dl{y — fp,\{x))\ < V(y — / p , A (x)) < V(l). Together, these estimates show 
\h{x,y)\ < cmaxjl, \y — /p,A(a;)| p_1 } < cc p (l + |y| p_1 + |/p,a(^) | p— 1 ) for some constant c 
depending only on L and c p := max{l, 2 P ~ 2 }. Now, using p'(p — 1) = p, we obtain 

\\h\\ Lpl{Q) < £c p (\Q\p + llfcll^l/p.Alir 1 + !)• (16) 

Finally, for later purpose we note that our previous considerations for p = 1 showed that 
(16) also holds in this case. □ 

To prove Theorem 12, we need the following preliminary results. 

Lemma 18. Suppose that the minimizer fp \ of (3) exists for all A > 0. Then we have 

lim^ p ,(/ P ,) = inf K L M :=n* LtPtH . 

Proof. Let e > and f e G H with K L P (/ e ) <TZ* LP H + e. Then for all A < e\\f £ \\ H 2 , we 
have KIp^ < X\\f P ,x\\ 2 H + ^, P (/ P ,a) < A||/ £ ||^ + ^ L , P (/ e ) < 2e + 7^ P ^. □ 

Lemma 19. Let L be a convex and invariant loss function of lower and upper order 
p>\ and let H be a RKHS of a universal kernel. Then for all distributions P on X xY 
with |P| p < oo, we have 1Z* L P H = 1Z* L P . 

Proof. Follows from Corollary 1 of Steinwart et al. [25] . □ 

Lemma 20. Let L be a convex invariant loss function of some type p > 1 and let P be 
a distribution on X xY with |P| p < oo. Then there exists a constant c p > that depends 
only on L and p such that for all bounded measurable functions f,g:X we have 

\n L , P (f) K l ,p{ 9 )\ < cfcdPlp-! + \\fF~ 1 + Hffll^ 1 + 1)11/ - fllloc- 
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Proof. Again we have V(\y\ + \a\) < Cp(|j/| p-1 + \a\P- 1 + 1) for all a G R, y E Y, and a 
suitable constant c p > that depends on L and p. Furthermore, we find 

\K L , P (f) - < ®r\KY - f(X)) - l(Y - g{X))\ 

< E P V(\Y\ + H/IU + || 5 ||oc)|/(X) - g(X)\. 

Now we easily obtain the assertion by combining both estimates. □ 



Lemma 21. Let Z be a measurable space, let P be a distribution on Z , let H be a Hilbert 
space and let g: Z — > H be a measurable function with \\g\\ q := (Ep HffH^) 1 ^ 9 < oo for some 
q G (1, oo). We write q* := min{l/2, 1/q'}. Then there exists a universal constant c q > 
such that, for all e > and all n>l, we have 



(zi,...,z n ) eZ r 



I ™ 

II — ' 



> e < c„ 



en 9 * 



For the proof of Lemma 21, we have to recall some basics from local Banach space 
theory. To this end, we call a sequence of independent, symmetric { — 1, +1}- valued ran- 
dom variables (eC) a Rademacher sequence. Now let E be a Banach space, let (Xi) be 
an i.i.d. sequence of i?-valued, centered random variables and let be a Rademacher 
sequence which is independent of (Xi). The distribution of Si is denoted by v. Using 
Hoffmann-j0rgensen ([15], Corolllary 4.2), we have for all 1 < p < oo and all n > 1 that 



5> 



i=l 



< 2 p E pn E 1/ , 



! = 1 



(17) 



where the left expectation is with respect to the distribution P™ of X-y, . . . ,X n , whereas 
the right expectation is also with respect to the distribution v n of e±, . . . ,£„. Furthermore, 
a Banach space E is said to have type p, 1 < p < 2, if there exists a constant c p (E) > 
such that for all n > 1 and all finite sequence x\ , . . . , x n G E we have E^n || £™ =1 EiXi || p < 
c p(E) 53&=i ll x ill p - I n the following, because we are only interested in Hilbert spaces i7, we 
note that these spaces always have type 2 with constant C2(H) = 1 by orthogonality. Fur- 
thermore, they also have type p for all 1 < p < 2 by Kahane's inequality (see, e.g., Dicstcl 
et al. [10], page 211), which ensures (E„» 1| £" =1 e^f )Vp < ^(E^ || £* =1 £iXi\\ q ) 1/q 
for all p, q G (0, 00), n > 1, all Banach spaces all #1 , . . . , x n G .E and constants c p>q only 
depending on p and g. 

Proof of Lemma 21. The summations used in this proof are with respect to 
% G {l,...,n}. Define h(z\, . . . ,z n ) := — £f( z i) — ^p5- A standard calculation shows 
P n (||/i|| >e) < e~ q Ep™\\h\\ q H ; hence it remains to estimate Bp™ By (17) we have 



E P „ - E Pff 1^ < 2«E P nE I/ n jj2 £ MZ t ) - E P g) 



(18) 
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For 1 < q < 2, we hence obtain 

Epn||ft||9<2%-«EpnE I/ n||^ £i ( 5 (Z i )-Ep 5 )| 9 < 2%rr^ Ep \\g{Zi) - E Pff 1 1 9 

^^cn^'Epllffll 9 , 

where c 9 is the type g constant of Hilbert spaces. From this we easily obtain the assertion 
for 1 < q < 2. Now let us assume that 2 < g < oo. Then using (18) and Kahane's inequality 
there is a universal constant c q > with 



Ep»||fc||j? < 2%- 9 Ep»E^ I^EiCg^) -E Pfl 
< c 9 n- ? E P n (e„« ||£)e<Cs(Z<) -E P5 ) 

,9/2 



2 v,/2 



< c q n- q E P n (J2 \\9(Zi) ~ Epffllff, 

^^(^(EpIlg^O-Epgll^) 2 / 9 ) 972 
<2%n-^ 2 E P ||. 9 ||^, 

where in the third step we used that Hilbert spaces have type 2 with constant 1. From 
this estimate we easily obtain the assertion for 2 < q < oo. □ 

Proof of Theorem 12. To avoid handling too many constants, we assume || oo = 1> 
|PL = 1, and c = 2-(p+ 2 ) for the upper order constant of L. Then an easy calculation 
shows 1Zl,p{0) < 1. Furthermore, we assume without loss of generality that A n < 1 for 

all n > 1. This implies ||/p a„||oo < ||/p A„||ff < A„ . Now for n&M and regularization 
parameter A„, let /i„ : AT x Y — > R be the function obtained by Theorem 10. Then our 
assumptions and (16) give IIMl ,(P) < 3 • 2P*/P- 2 \ n (p 1)/2 . Moreover, for g £ H with 

||/p,a„ -g\\H < 1 we have HpHoo < ||/p,a„ IU + ||/p,A„ ~ 9\\oo < 2A„ and, hence, Lemma 
20 provides a constant c p > that depends only on p and £ such that 



\K L .Afp,xJ -K L , P (g)\<cX~ p)/2 \\fr^ - fA\H, 



(19) 



for all geH with ||/p,A n - 9\\h < 1- Now let < e < 1 and D e (X x Y) n with corre- 
sponding empirical distribution D such that 



(20) 



||Ep^„$ - E D A„*||fl < c^A^^e. 
Then Theorem 10 gives ||/p,a„ — /d,a„]|h < c^An 2 £ < 1 and, hence, (19) yields 



|^,p(/p,aJ -^,p(/d,aJ| <c p \- {p - 1)/2 \\f P , Xn -/d,a„||h < 



(21) 
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Let us now estimate the probability of D satisfying (20). To this end we define q :=p' 
if p > 1 and q := 2 if p = 1. Then we have q* := min{l/2, 1/q'} = min{l/2, 1/p) =p/p*, 
and by Lemma 21 and ||A n ||z, ,(p) < 3 • 2 p */ p ~ 2 A^ p ~ 1 ^ 2 we obtain 

P n (D G (A x Y) n : ||E P /i„$ - E D h„$|| < c" 1 \ { P +1)/2 e) > 1 - q 

where c p is a constant that depends only on L and p. Now using A^n p / p = (A p n) p / p * — > 
oo, we find that the probability of samples sets £> satisfying (20) converges to 1 if n = 
\D\ — > oo. As we have seen above, this implies that (21) holds true with probability 
tending to 1. Now, since X n — > we additionally have |7?-l,p(/p,a„) — p | < e for all 
sufficiently large n and hence we finally obtain the assertion. □ 

To prove Theorem 15 we have to recall some notions from Banach space calculus. To 
this end, Be denotes the open unit ball of a Banach space E throughout the rest of 
this section. We say that a map G : E — > F between Banach spaces E and F is (Frechet) 
differentiable in xq G E if there exists a bounded linear operator A: E — ► F and a function 
Lp:E^F with %H -> for x -> such that 

\\ X \\ 

G(x +x) -G(x ) = Ax + tp(x) (22) 

for all x G E. Furthermore, because A is uniquely determined by (22), we write G'(x) := 
:= A. The map G is called continuously differentiable if the map x i— > G'(x) ex- 
ists on E and is continuous. Analogously we define continuous differentiability on open 
subsets of E. Moreover, we need the following two theorems which can be found, for 
example, in Akcrkar [1] and Cheney [3], respectively. 

Theorem 22 [Implicit function theorem). Let E and F be Banach spaces, and let 
G : E x F — > F be a continuously differentiable map. Suppose that we have (xo,yo) G E x F 
such that G(xo,yo) ~0 and §p(£EojJ/o) is invertible. Then there exists a S > and a con- 
tinuously differentiable map f :xq + 8Be — > yo + SBp such that for all x G x + 8Be, 
i/6!/o + SBp we have G(x, y) = if and only if y = f(x) . Moreover, the derivative of f 
is given by f'(x) = (*, /(x)))" 1 |f (x, f(x)). 

Theorem 23 (Fredholm alternative). Let E be a Banach space and let S : E — > E be a 
compact operator. Then ids + S is surjective if and only if it is infective. 

Proof of Theorem 15. The key ingredient of our analysis is the map G :M x H — > H 
defined by G{e,f) := 2Xf + E (1 _ e)P+£Az L' (Y, f(X))^(X) for all s G K, / G H. Let us 
first check that its definition makes sense. To this end, recall that every / G H is a 
bounded function because we assumed that H has a bounded kernel k. As in the proof of 
Proposition 6, wc then find Kp\L'(Y, f(X)) \ < oo for all / G H . Because the boundedncss 
of k also ensures that $ is a bounded map, we then see that the 7J-valued expectation 
used in the definition of G is defined for all e G K and all / £H. (Note that for e [0, 1] 
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the H- valued expectation is with respect to a signed measure; (cf. Dudley [11].) Now for 
e £ [0, 1] we obtain (see Christmann and Steinwart [5] for a detailed derivation) 

dR rcg 

G(eJ)= L ' {1 -^ +eA *'\ f). (23) 

Given an e £ [0, 1], the map / <— * R T L\i~e)P+eA a CO ^ s convex an d continuous by Lemma 
20, and hence (23) shows that G(e, /) = if and only if / = f(i- e )P+eA x ,\- Our aim is to 
show the existence of a diffcrcntiablc function e h- > f e defined on a small interval (—6, 6) 
for some 5 > that satisfies G(e,f e ) = for all e £ (—6,6). Once we have shown the 
existence of this function, we immediately obtain TF(z;T, P) = ^r(O). For the existence 
of e i— ► / e we have to check by Theorem 22 that G is continuously diffcrcntiablc and that 
(0, /p,a) is invcrtiblc. Let us start with the first. By an easy calculation we obtain 

BC 

°-g(e, f) = -E P L'(Y, f(X)MX) + Ka.L'(Y, f(X))$(X) (24) 

and a slightly more involved computation (cf. Christmann and Steinwart [5]) shows 
BC 

— (£,/) = 2Aid H +E (1 _ e)P+eA ^"(y,/(X))<$(X),.)$(X) = 5. (25) 

To prove that 4^ is continuous, we fix an e and a convergent sequence /„ — > / in if. Be- 
cause H has a bounded kernel, the sequence of functions (/„) is then uniformly bounded. 
By the continuity of L' we thus find a measurable bounded function g : Y — > R with 
L 1 {v., fn{x)) < L'(y,g(y)) for all n > 1 and all (x,y) £ 1 x 7. As in the proof of Propo- 
sition 6, we find (y <— > L(y,g(y))) £ Li(P) and, therefore, an application of Lebesgue's 
theorem for Bochner integrals gives the continuity of Because the continuity of G 
and can be shown analogously, we obtain that G is continuously differentiable (cf. 
Akerkar [1]). To show that f§(0, /p,a) is invertible it suffices to show by the Fredholm al- 
ternative that |g(0, /p, a ) is injective and that Ag := E P L"(Y, f PtX (X))g(X)$(X), g £ H, 
defines a compact operator on H . To show the compactness of the operator A, recall that 
X and Y are Polish spaces (see Dudley [11]), because we assumed that X and Y are 
closed. Furthermore, Borel probability measures on Polish spaces are regular by Ulam's 
theorem, that is, they can be approximated from inside by compact sets. In our situa- 
tion, this means that for all n > 1 there exists a compact subset X n X Y n C X x Y with 
P(X„ x Y n ) > 1 — Now we define a sequence of operators A n :H — > H by 



A n g~ / L"(y,/ PiA (x))P(dy|x) 5 ( a; )<I>(x)dPx(.T) (26) 

for all g £ H . Let us show that A n , n>l, is a compact operator. To this end we may 
assume without loss of generality that ||fe||oo ^ 1- For g £ Bh and x £ X, we then have 

h g (x):= f L"(y,f P , x (x))\g(x)\P(dy\x)<c [ (a(y) + |/p,a(x)| p + l)P(dy\x) =: h(x) 
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for a:Y — > R, p > 1 and c > according to the (a,p) type of L". Obviously, we have 
H^slli < ll^lli < 00 for all 5 G and, consequently, Ajjb g := h g dPx and d/i := /idPjt are 
finite measures. By Diestel and Uhl ([9], Corollary 8, page 48) we hence obtain 



A n g= / signg(x)$(x)d(i g (x) e n(X n )&co$(X n ), 

where aco$(X„) denotes the absolute convex hull of <$>(X n ), and the closure is with 
respect to || • ||jy. Now using the continuity of $ we see that <&(X n ) is compact and, 
hence, so is the closure of aco$(X„). This shows that A n is a compact operator. To 
see that A is compact, it therefore suffices to show \\A n — — > for n — > oo. Define 
B := (X x 7) \ (Jf„ x y„). Recalling that the convexity of L implies L" > 0, the latter 
convergence follows from P(X n x Y n ) > 1 — i, £/'(-, fp,\(-)) £ £i(P) and 

||A»ff-^||< / L"(y > A. 1 A(x))|ff(a:)|||$(a:)||dP( a: , y )<|| ff || H / /p,a(x)) dP(i, y), 

where again we assumed without loss of generality that ||£;||oo < 1- Let us now show that 
§H(0, /p,a) = 2A id# + A is injcctive. To this end, let us choose g £ H with g ^ 0. Then 
we find 

((2\id H + A)g, (2Xid H + A)g) > 4X(g,Ag) = AXE P L"(Y, fr, x (X))g 2 (X) > 0, 

which shows the injectivity. As already described, we can now apply the implicit function 
theorem to see that is diffcrentiable on (—6,6). Furthermore, (24) and (25) yield 

IF(z; T, P) = ^£(0) = S-^Ep^^Y, f P . x (X))$(X))) - L'(y, f P ^(x))S-^(x). 
oe 

□ 

Proof of Theorem 16. By Theorem 10, there exists an h £ Loo(P) such that 

H/p,a - /(i- £) p +£ p,aIIh < eX-'WEph^ -E^Wh KeX-'WkW^WhU^ + \\h\\ Li{P) ). 

Because h is independent of e and P, we then obtain the assertion by (13). □ 

Proof of Theorem 17. Using the notation of the previous proof and |/p,a||oo < ||fc||ooX 
\/TZl,p(0)/X, we obtain analogously to the proof of Theorem 10 that 

\h(x, y )\ < cdvr 1 + i/p.A^r 1 + 1) < cor 1 + wkr-^-^x-^/ 2 + 1), 

where c, c > are constants that depend only on L and p. We then obtain the first 
assertion by combining the above estimate with ||/p.a — /(i_ e )p +r p \ \\h < eA~ 1 ||Ep/i$ — 
E p /i$||# < eA" 1 ||fc|| 00 E| P _p| \h\. The second assertion can be shown analogously using 
Halloo < \L\i- Finally, the third assertion is a direct consequence of the second assertion 
and (8). □ 
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